Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

JenniferWang · 2025-09-24T18:51:54Z

Thanks to @casteryh 's PR of fixing the integration test #217, we are able to see the buggy behavior of the manual sharding logic.

This diff basically showcases why the test failed and how easy it is to introduce silent data correctness issue if we keep pursuing this route of manual sharding.

The plan is land these two PRs ASAP

Fix the broken test: integration test for weight sync that actually tests behavior #217
Replace the manual sharding with vanilla load weight calls: Weight loading working correctly with tp: use vllm builtin load_weights() #184

At the same time, this PR fixes the existing Qwen3 (non MoE) sharding.

Before the fix:

test_policy_update integration test fails when tp > 1
grpo (Qwen 1.7B) loss function super high at the start of the trainer

https://meta.wandb.io/jiyue/grpo-training/runs/hh3bht2w?nw=nwuserjiyue

After the fix:

test_policy_update integration test passes when tp > 1
grpo (Qwen 1.7B) loss function is much more reasonable??

https://meta.wandb.io/jiyue/grpo-training/runs/gnvlw7dc?nw=nwuserjiyue

allenwang28 · 2025-09-24T23:22:36Z

@JenniferWang life pro tip, you should be able to do pre-commit install (assuming you've pip install pre-commit) which will handle the linter for you automatically every time you do a git commit

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025

JenniferWang linked an issue Sep 24, 2025 that may be closed by this pull request

Give a better solution to manually calculated vLLM sharding logic in sharding.py #174

Closed

JenniferWang added 5 commits September 25, 2025 22:01

demonstrate the sharding bug when trainer export the weights

15bf741

fix attempt

656aca3

update main grpo app

a7bc8c3

nit

60caca7

format

9547532

JenniferWang force-pushed the vllm-sharding-bug branch from 33d0ec1 to 9547532 Compare September 26, 2025 13:19

JenniferWang added 2 commits September 26, 2025 09:25

nit

d81f3ba

fix

763b540

JenniferWang changed the title ~~Demonstrate the vLLM sharding bug when trainer export the weights~~ Fix manual vLLM Qwen3 sharding bug when trainer export the weights Sep 26, 2025

JenniferWang added 3 commits September 26, 2025 09:33

nit

5fbfd40

nit

31cf553

another callsites

12f809f

JenniferWang force-pushed the vllm-sharding-bug branch from 9fdc1fa to 12f809f Compare September 26, 2025 13:45

format

4f69504

joecummings approved these changes Sep 26, 2025

View reviewed changes

format

133f372

JenniferWang merged commit 825596b into meta-pytorch:main Sep 26, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

Uh oh!

JenniferWang commented Sep 24, 2025 •

edited

Loading

Uh oh!

allenwang28 commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

Fix manual vLLM Qwen3 sharding bug when trainer export the weights #229

Uh oh!

Conversation

JenniferWang commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before the fix:

After the fix:

Uh oh!

allenwang28 commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JenniferWang commented Sep 24, 2025 •

edited

Loading